Code basic models in R
Users can load an Rdata frame which contains three R objects:
You can also find all the data in an Excel sheet.
load("/Users/math90/Documents/Teaching/Ethiopia/OHSI_DataAnalytics_Aug2019/Data/NCI60/CleanGeneMetab_NCI_60.Rdata")
# Look at what objects are in your data:
ls()
## [1] "finalgenes" "finalmetab" "finannot" "smallmetab"
# Look at the dimensions of each:
dim(finalgenes)
## [1] 57 17987
dim(finalmetab)
## [1] 57 349
dim(finannot)
## [1] 57 4
dim(smallmetab)
## [1] 11 8
What information is available for each cell line? What is the summary for each type of variable? What is the size of each data frame? What’s in the rows and columns?
head(finannot)
## cell_line drugscore drugcateg cancertype
## 1 786-0 0.05153939 No Response Renal
## 2 A498 -0.45286203 Resistant Renal
## 3 A549/ATCC -0.41449565 Resistant NSCLC
## 4 ACHN -0.03506064 No Response Renal
## 5 BT-549 -0.34282314 Resistant Breast
## 6 CAKI-1 -0.03703719 No Response Renal
summary(finannot)
## cell_line drugscore drugcateg cancertype
## 786-0 : 1 Min. :-1.09231 No Response:30 Length:57
## A498 : 1 1st Qu.:-0.30818 Resistant :16 Class :character
## A549/ATCC: 1 Median :-0.05460 Sensitive :11 Mode :character
## ACHN : 1 Mean :-0.05754
## BT-549 : 1 3rd Qu.: 0.14492
## CAKI-1 : 1 Max. : 0.81607
## (Other) :51
Double check (you should always do this!) that the IDs in your sample meta data (finannot) match those of the abundance dta (finalgenes, finalmetab)
all.equal(rownames(finalgenes), finannot$cell_line)
## [1] "Modes: character, numeric"
## [2] "Attributes: < target is NULL, current is list >"
## [3] "target is character, current is factor"
# so that didn't work because finalgenes is a matrix and finannot is a data
# frame
all.equal(rownames(finalgenes), as.character(finannot$cell_line))
## [1] TRUE
all.equal(rownames(finalmetab), as.character(finannot$cell_line))
## [1] TRUE
Should we log our data? It’s best to have our samples as columns for much of the plotting so let’s take the transpose first.
set.seed(1)
finalmetab <- t(finalmetab)
finalgenes <- t(finalgenes)
par(mfrow = c(2, 2))
boxplot(as.data.frame(finalmetab), main = "Histogram of Metabolomic Data")
boxplot(as.data.frame(log2(finalmetab)), main = "Histogram of LOG Metabolomic Data")
boxplot(as.data.frame(finalgenes), main = "Histogram of Gene Expression Data")
boxplot(as.data.frame(log2(finalgenes)), main = "Histogram of LOG Gene Expression Data")
finalgenes <- log2(finalgenes)
finalmetab <- log2(finalmetab)
You should be familiar with this data, if not, look at the file “NCI60_HandsOn.html”
Let’s try this on one gene, let’s randomly pick a gene, and run a linear model to look at the association between the drug score and that gene.
First, let’s pick a gene and look at its distribution.
set.seed(1)
gene.index <- sample(1:ncol(finalgenes), 1)
mygene <- finalgenes[gene.index, ]
boxplot(mygene, main = rownames(finalgenes)[gene.index])
Now that we have a gene, let’s run a linear model to look at the association between that gene and drug response score.
mylm <- glm(mygene ~ finannot$drugscore, family = "gaussian")
mylm
##
## Call: glm(formula = mygene ~ finannot$drugscore, family = "gaussian")
##
## Coefficients:
## (Intercept) finannot$drugscore
## 11.5807 0.4104
##
## Degrees of Freedom: 56 Total (i.e. Null); 55 Residual
## Null Deviance: 10.79
## Residual Deviance: 8.978 AIC: 62.41
summary(mylm)
##
## Call:
## glm(formula = mygene ~ finannot$drugscore, family = "gaussian")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.77617 -0.30158 -0.03866 0.26774 0.82468
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.58070 0.05398 214.524 < 2e-16 ***
## finannot$drugscore 0.41039 0.12321 3.331 0.00155 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1632434)
##
## Null deviance: 10.7895 on 56 degrees of freedom
## Residual deviance: 8.9784 on 55 degrees of freedom
## AIC: 62.41
##
## Number of Fisher Scoring iterations: 2
plot(mylm)
# Code to get coefficients:
summary(mylm)$coefficients[, "Estimate"]
## (Intercept) finannot$drugscore
## 11.5806979 0.4103902
summary(mylm)$coefficients["finannot$drugscore", "Estimate"]
## [1] 0.4103902
# Code to get p-values:
summary(mylm)$coefficients[, "Pr(>|t|)"]
## (Intercept) finannot$drugscore
## 4.411216e-82 1.551540e-03
summary(mylm)$coefficients["finannot$drugscore", "Pr(>|t|)"]
## [1] 0.00155154
Note that linear models are not reversible! If you switch the outcome and the covariables, you will not get the same answer!!
Let’s take a look:
flipmylm <- glm(finannot$drugscore ~ mygene, family = "gaussian")
flipmylm
##
## Call: glm(formula = finannot$drugscore ~ mygene, family = "gaussian")
##
## Coefficients:
## (Intercept) mygene
## -4.785 0.409
##
## Degrees of Freedom: 56 Total (i.e. Null); 55 Residual
## Null Deviance: 10.75
## Residual Deviance: 8.948 AIC: 62.22
summary(flipmylm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.7846528 1.4201953 -3.36901 0.001384099
## mygene 0.4090227 0.1227983 3.33085 0.001551540
summary(mylm)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.5806979 0.0539832 214.52411 4.411216e-82
## finannot$drugscore 0.4103902 0.1232089 3.33085 1.551540e-03
Are there any potential confounders? Let’s go ahead and take those into account.
mylm2 <- glm(mygene ~ finannot$drugscore + finannot$cancertype, family = "gaussian")
mylm2
##
## Call: glm(formula = mygene ~ finannot$drugscore + finannot$cancertype,
## family = "gaussian")
##
## Coefficients:
## (Intercept) finannot$drugscore
## 11.4051 0.3312
## finannot$cancertypeCNS finannot$cancertypeColon
## 0.2645 0.4167
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma
## 0.1420 0.4667
## finannot$cancertypeNSCLC finannot$cancertypeOvarian
## 0.1691 0.1480
## finannot$cancertypeProstate finannot$cancertypeRenal
## -0.1222 -0.2361
##
## Degrees of Freedom: 56 Total (i.e. Null); 47 Residual
## Null Deviance: 10.79
## Residual Deviance: 6.385 AIC: 58.98
summary(mylm2)
##
## Call:
## glm(formula = mygene ~ finannot$drugscore + finannot$cancertype,
## family = "gaussian")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6162 -0.2227 -0.0238 0.2517 0.6944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.4051 0.1651 69.071 <2e-16 ***
## finannot$drugscore 0.3312 0.1485 2.230 0.0305 *
## finannot$cancertypeCNS 0.2645 0.2235 1.183 0.2427
## finannot$cancertypeColon 0.4167 0.2173 1.918 0.0613 .
## finannot$cancertypeLeukemia 0.1420 0.2483 0.572 0.5701
## finannot$cancertypeMelanoma 0.4667 0.2102 2.220 0.0313 *
## finannot$cancertypeNSCLC 0.1691 0.2068 0.818 0.4178
## finannot$cancertypeOvarian 0.1480 0.2205 0.671 0.5054
## finannot$cancertypeProstate -0.1222 0.3101 -0.394 0.6954
## finannot$cancertypeRenal -0.2361 0.2170 -1.088 0.2822
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1358458)
##
## Null deviance: 10.7895 on 56 degrees of freedom
## Residual deviance: 6.3848 on 47 degrees of freedom
## AIC: 58.978
##
## Number of Fisher Scoring iterations: 2
plot(mylm2)
# Code to get coefficients:
summary(mylm2)$coefficients[, "Estimate"]
## (Intercept) finannot$drugscore
## 11.4050986 0.3312194
## finannot$cancertypeCNS finannot$cancertypeColon
## 0.2644965 0.4166773
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma
## 0.1419963 0.4666523
## finannot$cancertypeNSCLC finannot$cancertypeOvarian
## 0.1690627 0.1479707
## finannot$cancertypeProstate finannot$cancertypeRenal
## -0.1221830 -0.2360613
summary(mylm2)$coefficients["finannot$drugscore", "Estimate"]
## [1] 0.3312194
# Code to get p-values:
summary(mylm2)$coefficients[, "Pr(>|t|)"]
## (Intercept) finannot$drugscore
## 6.500499e-49 3.054078e-02
## finannot$cancertypeCNS finannot$cancertypeColon
## 2.426703e-01 6.125147e-02
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma
## 5.701025e-01 3.131324e-02
## finannot$cancertypeNSCLC finannot$cancertypeOvarian
## 4.177617e-01 5.054287e-01
## finannot$cancertypeProstate finannot$cancertypeRenal
## 6.953627e-01 2.822152e-01
summary(mylm2)$coefficients["finannot$drugscore", "Pr(>|t|)"]
## [1] 0.03054078
We can use the same gene as we did above here. Let’s run an ANOVA to look for the association between the gene abundance level, and the drug category
myanova <- aov(mygene ~ finannot$drugcateg)
summary(myanova)
## Df Sum Sq Mean Sq F value Pr(>F)
## finannot$drugcateg 2 1.431 0.7155 4.129 0.0215 *
## Residuals 54 9.358 0.1733
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Here's how to get the p-value:
summary(myanova)[[1]]["finannot$drugcateg", "Pr(>F)"]
## [1] 0.02145496
Now, we know that there is indeed a difference in expression value of the gene associated with drug category. But we have 3 drug categories, so where is the difference? Between Resistant and No Response? Between Sensitive and No Response? Or between Sensitive and Resistant? For this, we are going to calculate the Tukey Honest Significance Differences, which takes into account multiple comparisons.
mytukey <- TukeyHSD(myanova)
# To get the p-values for each group (you would replace 'finannot$drugcateg'
# by the name of your category:
mytukey$`finannot$drugcateg`[, "p adj"]
## Resistant-No Response Sensitive-No Response Sensitive-Resistant
## 0.05382105 0.69900743 0.03107014
We know that cancer type could be a confounding variable that could impact the association of the gene and the drug category, let’s run an ANCOVA and adjust for cancer type.
myanova2 <- aov(mygene ~ finannot$drugcateg + finannot$cancertype)
summary(myanova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## finannot$drugcateg 2 1.431 0.7155 4.973 0.0111 *
## finannot$cancertype 8 2.741 0.3426 2.381 0.0308 *
## Residuals 46 6.618 0.1439
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Here's how to get the p-value:
summary(myanova2)[[1]]["finannot$drugcateg", "Pr(>F)"]
## [1] 0.0110821
# Now Tukey:
mytukey2 <- TukeyHSD(myanova)
# To get the p-values for each group (you would replace 'finannot$drugcateg'
# by the name of your category:
mytukey2$`finannot$drugcateg`[, "p adj"]
## Resistant-No Response Sensitive-No Response Sensitive-Resistant
## 0.05382105 0.69900743 0.03107014
How does these results compare to the linear models that use drug as a continuous variable, not a categorical variable?
In some cases, you may have a dichotomous variable, and you wish to see the association of this dichotomous variable with an outcome.
In our case, we could dichotomize the gene levels into two categories, “hi” and “low”, where “hi” cell lines have abundance values greater than the mean. Let’s give it a try:
genemean <- mean(mygene)
genecateg <- mygene
genecateg[which(mygene < genemean)] = "low"
genecateg[which(mygene >= genemean)] = "hi"
table(genecateg)
## genecateg
## hi low
## 29 28
Important: you need to make sure that your categories are factors or else your model won’t work. The reference sample is determined by alpha-numerical order (in this case, “hi”). There are ways to change this though:
genecateg <- as.factor(genecateg)
levels(genecateg)
## [1] "hi" "low"
genecateg <- relevel(genecateg, ref = "low")
levels(genecateg)
## [1] "low" "hi"
We can use the glm() function (generalized linear models function) again but specify that we want to run a logistic model:
mylogit <- glm(genecateg ~ finannot$drugscore, family = binomial(link = "logit"))
mylogit
##
## Call: glm(formula = genecateg ~ finannot$drugscore, family = binomial(link = "logit"))
##
## Coefficients:
## (Intercept) finannot$drugscore
## 0.1078 1.2136
##
## Degrees of Freedom: 56 Total (i.e. Null); 55 Residual
## Null Deviance: 79
## Residual Deviance: 75.38 AIC: 79.38
summary(mylogit)
##
## Call:
## glm(formula = genecateg ~ finannot$drugscore, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6649 -1.1023 0.8004 1.0886 1.5252
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1078 0.2767 0.390 0.6968
## finannot$drugscore 1.2136 0.6657 1.823 0.0683 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 79.001 on 56 degrees of freedom
## Residual deviance: 75.376 on 55 degrees of freedom
## AIC: 79.376
##
## Number of Fisher Scoring iterations: 4
plot(mylogit)
# Code to get coefficients:
summary(mylogit)$coefficients[, "Estimate"]
## (Intercept) finannot$drugscore
## 0.1078333 1.2136056
summary(mylogit)$coefficients["finannot$drugscore", "Estimate"]
## [1] 1.213606
# Code to get p-values:
summary(mylogit)$coefficients[, "Pr(>|z|)"]
## (Intercept) finannot$drugscore
## 0.69676697 0.06831394
summary(mylogit)$coefficients["finannot$drugscore", "Pr(>|z|)"]
## [1] 0.06831394
Again, we can adjust for the effect of cancer type:
mylogit2 <- glm(genecateg ~ finannot$drugscore + finannot$cancertype, family = binomial(link = "logit"))
mylogit2
##
## Call: glm(formula = genecateg ~ finannot$drugscore + finannot$cancertype,
## family = binomial(link = "logit"))
##
## Coefficients:
## (Intercept) finannot$drugscore
## -0.3864 0.3030
## finannot$cancertypeCNS finannot$cancertypeColon
## -0.2624 1.2725
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma
## 1.7940 2.3425
## finannot$cancertypeNSCLC finannot$cancertypeOvarian
## 0.2282 0.2095
## finannot$cancertypeProstate finannot$cancertypeRenal
## 0.4733 -17.1146
##
## Degrees of Freedom: 56 Total (i.e. Null); 47 Residual
## Null Deviance: 79
## Residual Deviance: 58.77 AIC: 78.77
summary(mylogit2)
##
## Call:
## glm(formula = genecateg ~ finannot$drugscore + finannot$cancertype,
## family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1182 -0.9881 0.4923 0.8240 1.5632
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3864 0.9156 -0.422 0.673
## finannot$drugscore 0.3030 0.9143 0.331 0.740
## finannot$cancertypeCNS -0.2624 1.2618 -0.208 0.835
## finannot$cancertypeColon 1.2725 1.2475 1.020 0.308
## finannot$cancertypeLeukemia 1.7940 1.5733 1.140 0.254
## finannot$cancertypeMelanoma 2.3425 1.4071 1.665 0.096 .
## finannot$cancertypeNSCLC 0.2282 1.1429 0.200 0.842
## finannot$cancertypeOvarian 0.2095 1.2234 0.171 0.864
## finannot$cancertypeProstate 0.4733 1.6971 0.279 0.780
## finannot$cancertypeRenal -17.1146 1494.1310 -0.011 0.991
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 79.001 on 56 degrees of freedom
## Residual deviance: 58.767 on 47 degrees of freedom
## AIC: 78.767
##
## Number of Fisher Scoring iterations: 16
plot(mylogit2)
# Code to get coefficients:
summary(mylogit2)$coefficients[, "Estimate"]
## (Intercept) finannot$drugscore
## -0.3863637 0.3030245
## finannot$cancertypeCNS finannot$cancertypeColon
## -0.2623841 1.2725390
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma
## 1.7940351 2.3424781
## finannot$cancertypeNSCLC finannot$cancertypeOvarian
## 0.2281633 0.2095381
## finannot$cancertypeProstate finannot$cancertypeRenal
## 0.4732513 -17.1145544
summary(mylogit2)$coefficients["finannot$drugscore", "Estimate"]
## [1] 0.3030245
# Code to get p-values:
summary(mylogit2)$coefficients[, "Pr(>|z|)"]
## (Intercept) finannot$drugscore
## 0.67303616 0.74031070
## finannot$cancertypeCNS finannot$cancertypeColon
## 0.83526891 0.30768899
## finannot$cancertypeLeukemia finannot$cancertypeMelanoma
## 0.25416655 0.09596663
## finannot$cancertypeNSCLC finannot$cancertypeOvarian
## 0.84177172 0.86400264
## finannot$cancertypeProstate finannot$cancertypeRenal
## 0.78035100 0.99086081
summary(mylogit2)$coefficients["finannot$drugscore", "Pr(>|z|)"]
## [1] 0.7403107
Is this what you would expect? Is it better to dichotomize your outcome or keep the values as continuous?
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.1 magrittr_1.5 formatR_1.7 tools_3.6.1
## [5] htmltools_0.3.6 yaml_2.2.0 Rcpp_1.0.2 stringi_1.4.3
## [9] rmarkdown_1.14 knitr_1.23 stringr_1.4.0 xfun_0.8
## [13] digest_0.6.20 evaluate_0.14